The output ko_abundance.tsv file was open with Excel to create an Excel file (ko_abundance.xlsx) with all three necessary data groups to work with Phyloseq objects: - TPM_df contains the relative abundance of each KO in the sample calculated in TPM. This information is in the ‘TPM’ sheet of the ‘ko_abundance.xlsx’ file. - KO_df contains the KO ids together with its description and symbol. This information is in the ‘KO’ sheet of the ‘ko_abundance.xlsx’ file. - samples_df contains the metadata of each sample. This information is in the ‘metadata’ sheet of the ‘ko_abundance.xlsx’ file.
#Loading the data
setwd("~/Desktop/RyC_TFM/final_results")
#ko_abundance <- read_excel("ko_abundance.xlsx", sheet = "KO")
TPM_df <- read_excel("ko_totalabundance.xlsx", sheet = "TPM")
#ko_abundance <- read_excel("ko_abundance.xlsx", sheet = "TPM")
KO_df <- read_excel("ko_totalabundance.xlsx", sheet = "KO")
samples_df <- read_excel("ko_totalabundance.xlsx", sheet = "metadata")
# We want to add a new samples variable to merge the subjects' Group (Donor, Fecal Microbial Transplant (FMT) or Placebo) with the data collecting Weeks variables
samples_df$group_week <- paste (samples_df$Group, samples_df$week, sep = "_")
A Phyloseq object is created with the data stored in the previously loaded dataframes that are firstly transformed to matrices when needed.
# Define the row names from the KEGG_ko column in the TPM_df
TPM_df <- TPM_df %>%
tibble::column_to_rownames("KEGG_ko")
TPM_df[is.na(TPM_df)] <- 0
# Idem for the other dataframes
KO_df <- KO_df %>%
tibble::column_to_rownames("KEGG_ko")
samples_df <- samples_df %>%
tibble::column_to_rownames("Sample_ID")
#Transform into matrices TPM and KO tables (sample table can be left as data frame)
TPM_mat <- as.matrix(TPM_df)
KO_mat <- as.matrix(KO_df)
class(TPM_mat) <- "numeric" # Because it contains the TPM calculated for each KO. We are going to consider this values as normalized read counts for each KO
# Transform to Phyloseq objects.
# Since PhyloSeq is usually implemented for taxonomy and here we are working with functional annotations, some equivalencies are made:
TPM = otu_table(TPM_mat, taxa_are_rows = TRUE)
KO = tax_table(KO_mat)
samples = sample_data(samples_df)
raw_ps <- phyloseq(TPM, KO, samples)
# We want to normalize the number of normalized reads so it can be compare between samples. For this, we are going to use the median sequencing depth. # Then, since we want read counts, we are going to round this value to have integers.
total = median(sample_sums(raw_ps))
standf = function(x, t=total) round(t * (x / sum(x)))
ps = transform_sample_counts(raw_ps, standf)
# NOTE: The data we are working with does not contain the 'UNMAPPED' functions proportion.
# We want to have the raw phyloseq object which contains normalized read counts (TPM) as integer numbers
round_fun <- function(x) round(x)
raw_ps <- transform_sample_counts(raw_ps, round_fun)
Since PhyloSeq is usually implemented for taxonomy and here we are working with functional annotations, some equivalencies are made:
Taxa in this case is refering to the KOs identifiers. Each KO identifier has a Description, Symbol and again the KO (3 taxonomic ranks) which are stored in the Taxonomy table. The OTU Table corresponds then to the relative abundance of each KO (taxa) in TPM units for each of the 117 samples that have 15 variables stored in the Sample Data section.
ps # Phyloseq Object schema
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7168 taxa and 117 samples ]
sample_data() Sample Data: [ 117 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 7168 taxa by 3 taxonomic ranks ]
ntaxa(ps) # Number of KO
[1] 7168
nsamples(ps) # Number of samples
[1] 117
sample_names(ps) # All sample names
[1] "R12W8" "R2W0" "R16W24" "543-46G1" "R30W7" "R2W1" "R16W0" "R8W7" "R14W4" "R14W7" "R11W24"
[12] "R30W4" "R20W24" "R24W1" "R20W8" "R14W0" "R18W24" "R4W8" "R2W24X" "R12W24" "R24W7" "R14W1"
[23] "R23W24" "R29W24" "R2W24Y" "R2W4" "R2W7" "0505-0101" "R30W1" "R30W0" "R16W7" "R24W4" "R24W24"
[34] "R8W0" "R26W0" "R18W8" "R10W7" "R20W0" "R14W24" "R4W7" "R18W4" "R24W8" "R20W1" "R6W0"
[45] "R12W1" "R18W7" "R4W4" "R2W8" "R12W0" "R19W24" "R21W24" "R6W7" "R30W8" "R18W0" "R20W4"
[56] "R12W7" "R22W0" "R10W24" "R18W1" "R10W0" "R4W1" "R12W4" "R4W0" "R14W8" "R17W24" "R20W6"
[67] "R25W0" "R8W24" "R1W7" "R17W1" "R3W0" "R9W7" "R1W5" "R29W8" "R17W0" "R7W8" "R30W24"
[78] "R23W8" "R3W7" "R9W0" "R6W24" "R27W0" "R17W7" "R17W4" "585-57G1" "R1W1" "R1W24" "R11W8"
[89] "R1W0" "R19W7" "R7W0" "R17W8" "R29W0" "R23W7" "R11W4" "R29W1" "R13W0" "R7W1" "R21W0"
[100] "R7W24" "R23W4" "R11W7" "R29W5" "R21W7" "R4W24" "R1W8" "R11W0" "R7W4" "R5W0" "R11W1"
[111] "R19W0" "R3W24" "R23W1" "R9W24" "R7W7" "R29W7" "R23W0"
rank_names(ps) # KO's characteristics
[1] "Description" "Symbol" "KO"
sample_variables(ps) # Sample variables
[1] "week" "id" "Group" "Donor"
[5] "Age" "Sex" "AIDS_events" "Nadir.CD4.T.cell"
[9] "CD4..T.cells" "CD4.CD8.ratio" "Antibiotics_.past_6_months" "Antibiotics_past_3_months"
[13] "Completed.Follow.up" "Sample_id" "group_week"
otu_table(ps)[1:5, 1:5] # Overview of the relative abundance table of each KO in TPM
OTU Table: [5 taxa and 5 samples]
taxa are rows
R12W8 R2W0 R16W24 543-46G1 R30W7
K07082 562 485 458 425 487
K08303 1467 647 1406 1344 1519
K01872 1474 839 1228 1083 1229
K02503 444 607 386 449 365
K02238 1135 622 939 804 1246
tax_table(ps)[1:5, 1:2] # Overview of the the characteristics of each KO
Taxonomy Table: [5 taxa by 2 taxonomic ranks]:
Description Symbol
K07082 "UPF0755 protein" "K07082"
K08303 "U32 family peptidase [EC:3.4.-.-]" "prtC, trhP"
K01872 "alanyl-tRNA synthetase [EC:6.1.1.7]" "AARS, alaS"
K02503 "histidine triad (HIT) family protein" "HINT1, hinT, hit"
K02238 "competence protein ComEC" "comEC"
In this step we are selecting the samples we want to work with ¿¿and filtering the KO to eliminate low abundant KOs??
incomplete <- c("R24", "R13", "R22", "R25", "R26", "R27", "R5") # List of id patients with incomplete data
# Here we are selecting data collected at week 0 and 7 from both FMT, placebo and Donor subjects except from subjects with incomplete data
ps_subset <- subset_samples(ps, (week == "Week_0" | week == "Week_7") & !(id %in% incomplete))
ps_subset
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7168 taxa and 45 samples ]
sample_data() Sample Data: [ 45 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 7168 taxa by 3 taxonomic ranks ]
# Here we are selecting data collected at week 0 and 7 from subjects that received the FMT from the already subset phyloseq object
ps_FMT <- subset_samples(ps_subset, Group=="FMT")
ps_FMT
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7168 taxa and 26 samples ]
sample_data() Sample Data: [ 26 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 7168 taxa by 3 taxonomic ranks ]
# Here we are selecting data collected at week 0 and 7 from Placebo subjects from the already subset phyloseq object
ps_placebo <- subset_samples(ps_subset, Group=="Placebo")
ps_placebo
phyloseq-class experiment-level object
otu_table() OTU Table: [ 7168 taxa and 16 samples ]
sample_data() Sample Data: [ 16 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 7168 taxa by 3 taxonomic ranks ]
# Here we are selecting the Donor samples (there is only information collected at week 0)
ps_donor <- subset_samples(ps, Group=="Donor")
ps_raw_subset <- subset_samples(raw_ps, (week == "Week_0" | week == "Week_7") & !(id %in% incomplete))
ps_raw_FMT <- subset_samples(ps_raw_subset, Group=="FMT")
ps_raw_placebo <- subset_samples(ps_raw_subset, Group=="Placebo")
ps_raw_week0 <- subset_samples(ps_raw_subset, (Group == "Placebo" | Group == "FMT") & week == "Week_0")
ps_raw_week7 <- subset_samples(ps_raw_subset, (Group == "Placebo" | Group == "FMT") & week == "Week_7")
[1] "K06147" "K21572" "K02004" "K03088" "K01190" "K07133" "K02003" "K05349" "K01990" "K03205" "K03169" "K01992" "K02529" "K02355" "K01187"
[16] "K03655" "K03296" "K03497" "K02469" "K04759" "K03701" "K03046" "K01915" "K03043" "K12132" "K02030" "K03496" "K02035" "K03798" "K03737"
[31] "K01955" "K03657" "K01897" "K02470" "K01153" "K03310" "K08303" "K16787" "K00266" "K03654" "K02337" "K01154" "K01952" "K00688" "K03308"
[46] "K00615" "K02014" "K00558" "K02029" "K06180"
plot <- plot_bar(ps_top50, fill = "KO") +
geom_bar(aes(color=KO, fill=KO), stat="identity", position="fill") +
labs(x="Samples", y="Abundance", title="The 50 most abundant KO") +
facet_wrap(~week, scales="free_x") +
theme(axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 20),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
legend.title = element_text(size = 20),
legend.text = element_text(size = 15),
strip.text = element_text(size = 20),
plot.title = element_text(size = 30))
plot
ps_top50_relab = phyloseq::transform_sample_counts(ps_top50, function(x){(x / sum(x))})
plot <- plot_bar(ps_top50_relab, fill = "KO") +
geom_bar(aes(color=KO, fill=KO), stat="identity", position="fill") +
labs(x="Samples", y="Relative abundance between Top50", title="The 50 most abundant KO") +
facet_wrap(~week, scales="free_x") +
theme(axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 20),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
legend.title = element_text(size = 20),
legend.text = element_text(size = 15),
strip.text = element_text(size = 20),
plot.title = element_text(size = 30))
plot
Next, we are going to calculate the Alpha Diversity of our samples. Definition of alpha diversiy and estimates we are using
To do so, we first need to add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. In this case, will focus on the Shannon, Chao1 and Observed diversity estimates.
You want the data to be roughly normal so that you can run ANOVA or t-tests. If it is not normally distributed, you will need to consider non-parametric tests such as Kruskal-Wallis.
Overall, for alpha-diversity:
ANOVA, t-test, or general linear models with the normal distribution are used when the data is roughly normal Kruskal-Wallis, Wilcoxon rank sum test, or general linear models with another distribution are used when the data is not normal.
First, we need to extract the estimates of the alpha diversity and then perform a Shapiro-Wilk Normality test to see if this data follows a normal distribution. The null hypothesis of this test is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.
# Extract Shannon results
alpha_results <- estimate_richness(ps_subset, measures = c("Shannon", "Observed", "Chao1"))
Warning in estimate_richness(ps_subset, measures = c("Shannon", "Observed", :
The data you have provided does not have
any singletons. This is highly suspicious. Results of richness
estimates (for example) are probably unreliable, or wrong, if you have already
trimmed low-abundance taxa from the data.
We recommended that you find the un-trimmed data and retry.
# Shapiro-Wilk Normality test.
# The null hypothesis of this test is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.
shapiro.test(alpha_results$Shannon)
Shapiro-Wilk normality test
data: alpha_results$Shannon
W = 0.96052, p-value = 0.1279
shapiro.test(alpha_results$Observed)
Shapiro-Wilk normality test
data: alpha_results$Observed
W = 0.97096, p-value = 0.3136
shapiro.test(alpha_results$Chao1)
Shapiro-Wilk normality test
data: alpha_results$Chao1
W = 0.97096, p-value = 0.3136
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality for all three alpha diversity estimates
We can plot the Shannon Diversity of each sample group and test its significance with t-tests.
We are going to normalize the raw phyloseq object using DESeq2.
set.seed(0503)
# First we need to convert the raw phyloseq object to a deseq object
ps_dds <- phyloseq_to_deseq2(ps_raw_subset, ~Group + week )
converting counts to integer mode
Warning in DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
# and running DEEeq2 standard analysis:
ps_dds <- DESeq(ps_dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 947 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
# geometric mean, set to zero when all coordinates are zero
geo_mean_protected <- function(x) {
if (all(x == 0)) {
return (0)
}
exp(mean(log(x[x != 0])))
}
geoMeans <- apply(counts(ps_dds), 1, geo_mean_protected)
ps_dds <- estimateSizeFactors(ps_dds, geoMeans = geoMeans)
ps_dds <- estimateDispersions(ps_dds)
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
abund <- getVarianceStabilizedData(ps_dds)
# making our phyloseq object with transformed table
vst_count_phy <- otu_table(abund, taxa_are_rows=T)
sample_info_tab_phy <- sample_data(samples_df)
vst_physeq <- phyloseq(vst_count_phy, sample_info_tab_phy)
Too few points to calculate an ellipse
Too few points to calculate an ellipse
Too few points to calculate an ellipse
Too few points to calculate an ellipse
Betadisper and permutational ANOVA
Here we are going to test if there is a statistically signficant difference between our sample types. One way to do this is with the betadisper and adonis functions from the vegan package. adonis can tell us if there is a statistical difference between groups. However, adonis assumes that there is no statistical difference within the same group. To check whether this assumption is met, we first need to perfom a betadisper permutest. If the p-value is non-significant (p-value > 0.05), there is a sufficient level of homogeneity of dispersion within groups. If there is not, then the adonis permanova test can be unreliable.
betadisper(d = bray_dist, group = sample_data(vst_physeq)$Group)
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 2 2.0650e-06 1.0326e-06 0.2728 1000 0.7702
Residuals 42 1.5897e-04 3.7849e-06
Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
Donor FMT Placebo
Donor 0.55045 0.7193
FMT 0.55631 0.6064
Placebo 0.71780 0.60870
The p-value of the betadisper permutest is higher than 0.05 so we can assume that sample groups are sufficiently homogenous and we can carry on with the adonis permanova test.
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = bray_dist ~ sample_data(vst_physeq)$Group)
Df SumOfSqs R2 F Pr(>F)
sample_data(vst_physeq)$Group 2 0.00016144 0.06964 1.5718 0.049 *
Residual 42 0.00215691 0.93036
Total 44 0.00231836 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the Adonis permanova test is lower than 0.05 so there is a statistically significant difference between groups.
betadisper(d = bray_dist, group = sample_data(vst_physeq)$group_week)
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 4 1.0946e-05 2.7366e-06 0.7654 1000 0.5475
Residuals 40 1.4301e-04 3.5754e-06
Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
Donor_Week_0 FMT_Week_0 FMT_Week_7 Placebo_Week_0 Placebo_Week_7
Donor_Week_0 0.41758 0.94505 0.94705 0.6054
FMT_Week_0 0.40574 0.17283 0.16484 0.6094
FMT_Week_7 0.95017 0.17571 0.82418 0.4585
Placebo_Week_0 0.94825 0.17899 0.84344 0.4196
Placebo_Week_7 0.62116 0.61545 0.47506 0.41651
The p-value of the betadisper permutest is higher than 0.05 so we can assume that sample groups are sufficiently homogenous and we can carry on with the adonis permanova test.
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = bray_dist ~ sample_data(vst_physeq)$group_week)
Df SumOfSqs R2 F Pr(>F)
sample_data(vst_physeq)$group_week 4 0.00024664 0.10639 1.1905 0.168
Residual 40 0.00207171 0.89361
Total 44 0.00231836 1.00000
The p-value of the Adonis permanova test is higher than 0.05 so we can’t prove that there is a statistically significant difference between groups.
betadisper(d = bray_dist, group = sample_data(ps_subset)$Group)
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 2 0.000661 0.00033064 0.1124 1000 0.8971
Residuals 42 0.123594 0.00294271
Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
Donor FMT Placebo
Donor 0.95005 0.8631
FMT 0.94021 0.6094
Placebo 0.86006 0.62856
The p-value of the betadisper permutest is higher than 0.05 so we can assume that sample groups are sufficiently homogenous and we can carry on with the adonis permanova test.
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = bray_dist ~ sample_data(ps_subset)$Group)
Df SumOfSqs R2 F Pr(>F)
sample_data(ps_subset)$Group 2 0.11267 0.08784 2.0223 0.015 *
Residual 42 1.17000 0.91216
Total 44 1.28267 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the Adonis permanova test is lower than 0.05 so there is a statistically significant difference between groups.
betadisper(d = bray_dist, group = sample_data(ps_subset)$group_week)
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 4 0.007034 0.0017585 0.622 1000 0.6503
Residuals 40 0.113088 0.0028272
Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
Donor_Week_0 FMT_Week_0 FMT_Week_7 Placebo_Week_0 Placebo_Week_7
Donor_Week_0 0.76224 0.72028 0.56444 0.9720
FMT_Week_0 0.74426 0.22178 0.15684 0.6354
FMT_Week_7 0.71613 0.23197 0.84316 0.5135
Placebo_Week_0 0.57230 0.16878 0.83262 0.3467
Placebo_Week_7 0.96874 0.65043 0.52312 0.35655
The p-value of the betadisper permutest is higher than 0.05 so we can assume that sample groups are sufficiently homogenous and we can carry on with the adonis permanova test.
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = bray_dist ~ sample_data(ps_subset)$group_week)
Df SumOfSqs R2 F Pr(>F)
sample_data(ps_subset)$group_week 4 0.16104 0.12555 1.4357 0.064 .
Residual 40 1.12163 0.87445
Total 44 1.28267 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the Adonis permanova test is higher than 0.05 so we can’t prove that there is a statistically significant difference between groups.
Find features that are up or down represented in the compared groups
using the microbial package that implements DESeq2. - The difftest
function from the microbial package finds features that are up or down
represented in the compared groups. The difftest function requires a
phyloseq object containing merged information of abundance, taxonomic
assignment and sample data including the measured variables and
categorical information of the samples. Since the DESeq2 transformation
is performed in this function, RAW COUNT values are preferred.
In this context, we will be using the ps_raw phyloseq object and its
subsets.
- The plotdiff function produces a figure with the top most annotated
features with corresponding adjusted p-values and abundance distribution
with respect to control conditions.
The group parameter is a character string specifying the name of a categorical variable containing grouping information. The pvalue and log2FC are thresholds for p values and log2 fold change. Adjusted p value cutoff is also supported by specify the padj paramater.
converting counts to integer mode
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1255 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
converting counts to integer mode
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 873 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
converting counts to integer mode
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1153 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
converting counts to integer mode
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 962 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing